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Abstract 


We  discovered  a  surprising  law  governing  the  spatial  join  selectivity  across  two  sets  of  points.  An 
example  of  such  a  spatial  join  is  "find  the  libraries  that  are  within  10  miles  of  schools".  Our  law 
dictates  that  the  number  of  such  qualifying  pairs  follows  a  power  law,  whose  exponent  we  call  "pair- 
count  exponent"  (PC).  We  show  that  this  law  also  holds  for  self-spatial-joins  ("find  schools  within 
5  miles  of  other  schools")  in  addition  to  the  general  case  that  the  two  point-sets  are  distinct.  Our  law 
holds  for  many  real  datasets,  including  diverse  environments  (geographic  datasets,  feature  vectors 
from  biology  data,  galaxy  data  from  astronomy). 

In  addition,  we  introduce  the  concept  of  the  Box-Occupancy-Product-Sum  (BOPS)  plot,  and  we 
show  that  it  can  compute  the  pair-count  exponent  in  a  timely  manner,  reducing  the  run  time  by  orders 
of  magnitude,  from  quadratic  to  linear.  Due  to  the  pair-count  exponent  and  our  analysis  (Law  1),  we 
can  achieve  accurate  selectivity  estimates  in  constant  time  (0(  1 ))  without  the  need  for  sampling  or 
other  expensive  operations.  The  relative  error  in  selectivity  is  about  30%  with  our  fast  BOPS  method, 
and  even  better  (about  10%),  if  we  use  the  slower,  quadratic  method. 


1.  INTRODUCTION 


Multi-dimensional  and  spatial  database  management  systems  (DBMS)  have  attracted  a  lot  of  interest.  One  of 
the  most  important  operations  in  a  spatial  DBMS  [0*94]  is  the  spatial  join,  which  is  the  counterpart  to  the  equi- 
join  in  a  relational  DBMS. 

The  typical  query  is  also  called  the  ‘all  pairs’  query  or  ‘spatial  distance  join’,  as  in  the  example,  * Estimate 
the  number  of  schools  that  are  within  5  miles  from  libraries' .  Spatial  distance  joins  are  considered  to  be 
i  among  the  most  exxential  joins  in  application  areas,  like  data  mining  [cmn99]  [nh94].  They  are  useful  in  multiple 

settings,  such  as  the  following. 

•  In  geographic  information  systems  (GIS)  under  the  name  of  overlay  queries:  for  example,  ‘ Find  all  houses 

•  within  2  miles  of  a  river  . 

•  In  urban  planning,  business  planning,  commercial  intelligence:  'How  many  households  are  within  1  mile 
of  our  branches  and  from  our  competition  ys  branches \ 

•  In  spatial  data  mining  to  detect  correlations  and  test  hypotheses:  for  example,  4 Find  4-bedroom  houses 
that  are  within  5  miles  of  a  school  \  or  ‘ How  many  luxury  apartments  are  within  2  miles  of  a  lake  ’  [nh94]. 

•  In  temporal  data  mining:  4  Find  economic  embargos  that  were  followed  by  war  within  a  year ,  or  'Find 
network- switch  failures  that  were  within  5  seconds  of  a  power  surge 9  [mtv95]  [hkm+96]. 

•  In  multimedia  and  traditional  databases:  'Find  pairs  of  stock  price  changes  that  are  within  $10  of  each 
other y  [frm94]. 

The  spatial  distance  join  is  defined  using  two  spatial  data  sets,  A  and  B ,  and  a  distance  function  L.  For  a 
given  radius  r ,  the  spatial  distance  join  computes  { <a9b>  |  aEA  and  b  6B ,  L(a,b)  <  r}.  A  special  case  arises 
when  the  two  datasets,  A  and  B  are  identical.  Such  joins  will  be  qualified  as  ‘self  spatial  joins’.  We  will  use 
the  term  ‘cross  spatial  joins’,  when  we  need  to  emphasize  that  the  two  point  sets  are  distinct.  Otherwise,  we 
will  simply  use  the  term  ‘spatial  join’  to  denote  a  spatial  distance  join  between  two  distinct  datasets. 

The  goal  of  this  work  is  to  estimate  the  selectivity  of  spatial  joins  among  two  datasets  as  opposed  to  only 
one.  The  join  selectivity  represents  the  size  of  the  resultant  set  of  the  spatial  distance  join  divided  by  the  size 
of  the  Cartesian  product  of  the  whole  data.  Estimation  of  the  join  selectivity  is  important  for  the  following  two 
reasons. 

•  An  accurate  estimation  is  necessary  to  optimize  complex  queries.  Though  there  has  been  quite  a  lot  of 
work  done  on  how  to  estimate  the  selectivity  of  equi-joins,  the  problem  of  estimating  the  size  of  spatial  joins 
has  received  only  minimal  attention  up  to  now. 

•  In  application  areas  like  the  ones  mentioned  earlier,  the  size  of  the  spatial  distance  join  (as  a  function  of 
the  radius)  is  important  for  evaluating  the  correlation  between  datasets.  Note  that  it  is  generally  too  costly 
to  obtain  the  size  of  the  spatial  join  by  simply  computing  the  spatial  distance  join  itself.  Therefore,  an 
accurate  and  inexpensive  method  is  required  to  estimate  the  size  of  spatial  distance  joins. 

Our  main  contribution  is  that  we  observe  a  ‘power  law’,  which  holds  for  many  pairs  of  real  datasets.  We 
show  how  to  use  this  power  law  to  accurately  estimate  the  spatial  join  selectivities  efficiently  (in  constant  time, 

•  0(1)). 

The  rest  of  the  paper  is  organized  as  follows.  Section  2  presents  the  related  work.  Section  3  describes  our 
main  contribution,  the  pair-count  exponent  9  and  the  fast  way  to  estimate  it,  through  the  proposed  box- 
occupancy-product-sum  (BOPS).  Section  4  discusses  implementation  and  speed  issues  of  the  proposed 
methods.  Section  5  gives  experimental  results,  and  Section  6  discusses  issues  for  practitioners.  Section  7 
presents  the  conclusions. 
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2.  RELATED  WORK 

There  has  been  quite  a  lot  of  work  on  spatial  joins  recently.  See,  for  example  [ore86],  [bks93],  [lr94],  [pd96],  [ks97], 

[apr+  98]  and  [mp99j.  Most  of  the  mentioned  work  has  dealt  with  developing  efficient  methods  to  process  spatial 
intersection  joins  for  two-dimensional  data  sets  [bsw99]  [dns91]  [sk96]  with  little  emphasis  on  the  estimation  of 
selectivity.  Recently,  methods  have  also  been  examined  and  developed  for  processing  spatial  distance  joins  on 
multidimensional  point  sets  [ssa97],  [ksosj.  The  term  “ similarity  join ”  has  frequently  also  been  used  for  spatial 
distance  joins  in  the  literature.  For  one-dimensional  data,  the  spatial  distance  join  corresponds  to  the  ‘band-join’  * 

[DNS91J. 

Although  not  directly  related  to  our  spatial  join  selectivity,  we  mention  earlier  attempts  to  estimate  the 
selectivity  of  range  queries.  Typical  methods  include  the  milestone  ‘uniformity  and  independence’  assumptions  • 

[sac+79].  Although  simple  to  use  in  a  query  optimizer,  these  assumptions  are  pessimistic  and  unrealistic  [cum]. 

Modern  methods  include  histograms  [P0097],  kernel  estimators  [bks99],  wavelets  [vw99],  and  hybrid  methods  using 
query  feedback  [kw99].  Methods  for  selectivity  estimation  of  range  queries  in  spatial  datasets  use  multi¬ 
dimensional  histograms  [ts96],  or  arguments  from  the  theory  of  fractals  [bf95].  It  should  be  noted  that  most  of 
these  methods  are  susceptible  to  the  ‘dimensionality  curse’  [sil96]  [scowi. 

Analytical  estimates  of  spatial  distance  join  selectivities  are  few.  The  very  recent  work  presented  in  [pmt99] 
assumed  the  data  are  uniformly  distributed  in  the  address  space.  As  mentioned  earlier,  the  uniformity 
assumption  was  discredited  long  ago  [chr84],  [fk94]  as  unrealistic  and  unfeasible.  Our  experiments  in  Section  5 
indeed  show  that  it  is  unrealistic.  The  cost  model  presented  [tss  98i  was  built  for  datasets  not  uniformly 
distributed  datasets  using  R-tree-based  structures. 

In  the  next  sections  we  proceed  with  our  proposed  solution.  The  major  observation  is  that  the  selectivity 
of  spatial  distance  joins  follows  a  power  law  surprisingly  well. 

3.  PROPOSED  METHOD 

Our  main  contribution  and  its  corollaries  are  discussed  below.  The  problem  to  be  solved  is  the  following. 

Given:  two  point-sets  A  and  B  and  a  radius  r 

Find:  the  distribution  of  the  count  of  pairs,  as  a  function  of  the  radius  r . 

That  is,  is  this  distribution  Gaussian?  Is  it  Poisson?  Is  it  Weibul?  It  turns  out  that  real  datasets  do  not  follow 
any  of  the  traditional  statistical  distributions.  Instead,  we  show  that  the  distribution  of  the  pair-wise  distances 
follows  a  power  law.  Table  1  lists  symbols  used  in  this  document.  Next,  we  describe  our  power  law,  as  well 
as  several  useful  properties  of  its  exponent. 

3.1.  Pair-count  function  and  the  PC  exponent 

We  propose  to  study  the  probability  distribution  function  of  the  number  of  pairs  as  a  function  of  the  distance 
between  those  pairs.  Specifically,  we  define  and  study  the  pair-count  function  PCA  B  ( r ),  or  simply  PC(r),  of 
two  point-sets  A  and  B  used  in  a  spatial  join  query.  It  is  defined  as  follows. 

Definition  1:  For  two  point-sets  A  and  B,  we  define  PCAB  ( r )  as  the  pair-count  function,  that  is,  the  count  of 
pairs  within  distance  r  or  less.  The  first  member  of  the  pair  should  belong  to  point  set  A,  and  the  second 
member  to  point  set  B. 

PCA  B  (r)  =  count(  of  A-B  pairs,  within  distance  <  r  ) 

Some  observations  are  helpful: 

•  Our  PC(r )  function  roughly  corresponds  to  the  ‘cumulative  probability  density  function’  from  statistics. 
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•  We  typically  omit  the  subscripts  A,  B  for  simplicity. 

•  The  implied  distance  function  can  be  any  Lp  norm.  We  use  the  Linfmjr>,  norm  unless  otherwise  specified. 
The  reason  is  that  all  the  upcoming  results  hold  for  any  Lp  norm,  but  the  formulas  are  simpler  for  the  Linfinjty 
norm. 

•  For  a  self  spatial  join  (i.e.,  A==  B)  we  omit  the  self-pairs,  and  we  count  each  pair  only  once.  That  is,  if 
there  are  N  points  in  the  set,  we  consider  N*(N- 1)/2  pairs.  Again,  the  upcoming  results  can  be  easily 
adapted  to  handle  any  of  the  omitted  cases. 

For  reasons  that  will  soon  be  obvious,  we  define  the  concept  of  the  pair-count  plot: 

Definition  2:  The  pair-count  plot,  or  simply  PC-plot,  for  two  point  sets  A  and  B  is  the  plot  of  PCA  B  (r)  versus 
r,  in  log-log  scales. 

Figure  1  presents  (a)  a  pair-count  plot  for  real  datasets  in  linear  scales,  and  (b)  the  same  pair-count  plot 
in  log-log  scales  (b).  The  datasets  are  explained  in  Section  5.  The  question  is  whether  functions  obey  any 
rules?  It  turns  out  that  many  of  them  indeed  follow  a  law,  specifically  a  power  law,  as  we  discuss  next.  The 
experiments  we  have  done  with  many  real  datasets  show  that  many  of  them  result  in  a  PC-plot  that  is  almost 
linear  (within  1.5%  MLS  error  and  typically  less)  for  a  suitable  range  of  distances  r  (radius  from  r,  to  r2 ). 
Considering  this,  we  present  our  major  result. 


Figure  1  -  The  Pair-count  plot  of  California  datasets  (CA-str  cross  joined  with  CA-wat)  (a)  linear 
scales,  and  (b)  log-log  scales 


Law  1  (PAIR-COUNT):  For  several  real  datasets  and  for  a  usable  range  of  scales,  the  pair-count  PC(r)  of 
pairs  within  distance  r  or  less  follows  a  power  law: 

PC(r)  =  Kr 

where  K  is  a  proportionality  constant.  Equivalently  Definition  3  follows. 


* 

Definition  3:  The  exponent  of  the  law  is  defined  as  the  pair-count  exponent  £P  as 

^  <2(log(PC(r))) 

o(log(r)) 


(2) 
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Figure  1(b)  shows  the  pair-count  plot  for  the  same  pair  of  datasets  as  Figure  1(a)  in  log-log  scales.  The 
plots  are  clearly  linear,  for  a  significant  range  of  scales.  This  range  is  usually  most  sought  after  for  queries; 
we  are  not  interested  in  radii  much  smaller  or  larger  than  the  typical  distances  involved  in  the  dataset. 

Figure  2  shows  PC-Plots  and  fitting  lines  for  two  cross-joins  of  California  datasets,  a  streets  cross  joined 
with  railroads  and  b  streets  cross  joined  with  water.  The  description  of  these  datasets  and  additional  PC(r ) 
plots  are  shown  later  in  Section  5,  which  deals  with  our  experiments. 


-2  -1  -7  .5  ^  -3  -2  -1  ,  0  ,  1.  2 

log(dist)  log(dist) 


Figure  2  -  PC-Plots  and  slopes  of  the  fitting  lines  and  the  pair-count  exponent  for  two 
pairs  of  California  datasets:  (a)  streets  cross  joined  with  railroads;  (b)  streets  cross  joined 
with  water. 

3.2.  Properties  of  the  pair-count  exponent  9* 

The  following  observations  show  some  of  the  interesting  properties  of  the  pair-count  exponent  S’. 

•  Observation  1:  The  pair-count  exponent  9*  includes  the  “  correlation  fractal  dimension’’  D2  as  a 
special  case. 

Justification:  When  the  second  dataset  is  identical  to  the  first,  the  PC  exponent  is,  by  definition,  equal  to 
the  “correlation  fractal  dimension”  [belussijsi.  Intuitively,  this  is  the  ‘intrinsic’  dimensionality  of  the  dataset. 

•  Observation  2:  The  pair-count  exponent  tPis  invariant  to  affine  transformations,  namely  to  translation, 


log(dist)  log(dist) 


Figure  3  -  Illustration  of  the  effects  of  sampling  on  the  pair-count  exponent  !?.  The  PC-plots  for 
the  full  datasets  and  for  20%,  10%  and  5%  samples,  (a)  California  pol  X  wat  and  (b)  Galaxy 
dev  and  X  exp. 
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rotation,  and  uniform  scaling. 

Justification:  By  ‘uniform  scaling’  we  mean  that  all  the  axes  are  scaled  by  the  same  amount.  Translation 
and  rotation  do  not  affect  the  distances  and  thus  leave  the  plots  unchanged.  Uniform  scaling  scales  all  the 
distances,  and  thus  shifts  the  plot  to  the  left  or  the  right.  Its  slope,  however,  remains  the  same. 

•  Observation  3:  The  pair-count  exponent  IP  is  invariant  to  sampling. 

Justification:  Sampling  is  useful  when  we  deal  with  large  datasets,  although  our  upcoming  BOPS  algorithm 
can  handle  huge  datasets  even  better.  It  is  useful  that  our  power  law  holds  for  subsets  of  our  data.  The 
intuitive  argument  is  as  follows.  Consider  a  dataset  A  with  N  points  and  a  sampling  rate pa  (0<  pa<  1),  that 
is  the  sample  has  N*pa  points.  Similarly,  let  M  be  the  number  of  points  in  dataset  B,  and  let  pb  be  its 
sampling  rate.  Consider  a  point  a,  from  the  dataset  A  and  let  a,(r)  be  the  number  of  its  B-type  neighbors 
within  distance  r.  After  sampling,  it  will  have  pj  r)  *pb  neighbors  on  the  average.  Thus,  the  total  number 
of  pairs  in  the  two  samples  within  distance  r  will  be  the  original  PC( r)  times  pa*pb  on  the  average.  This 
will  not  change  the  slope  of  the  PC-plot:  it  will  only  lower  the  position  of  the  plot,  by  log(pa*pb). 

Figure  3  shows  the  PC(r )  plots  for  two  pairs  of  datasets.  In  (a)  it  shows  California  political  cross  joined 
with  California  water  and  in  (b)  it  shows  Galaxy-dev  cross-joined  with  Galaxy-exp,  as  well  as  their  20%,  10% 
and  5%  samples.  Notice  that  the  plots  are  linear,  and  those  corresponding  to  samples  are  parallel  to  the  full 
dataset.  Tables  3  and  4  summarize  their  lvalues. 

•  Observation  4:  The  pair-count  exponent  £P  is  invariant  to  the  Lp  distance 
used. 

Justification:  Consider  the  ‘sphere’  that  each  Lp  metric  defines  (see  Figure 
4).  Let  vol(p,r)  be  the  volume  of  an  n-dimensional  Z^- ‘sphere’  of  radius  r. 

For  p= 2,  this  is  indeed  a  sphere;  for  p=infinity  this  is  an  n-dimensional 
cube,  etc.  Our  power  law  states  that  the  number  of  type-B  neighbors  of  a 
type-A  point  grows  as  >'  or,  equivalently  it  grows  as  volume  /E.  Then,  if 
PCp(r)  denotes  the  number  of  neighbors  within  Lp  distance  r,  we  have: 

PCAB  ( r,Lp)  =  PC(r ,  L^)  -  ( vol(p ,  r )  /  volfp,  r ))  %  (3) 

therefore,  the  number  of  pairs  will  only  differ  by  a  multiplicative  constant  for  different  values  of  p  in  the  Lp 


L„  Lj  and  L2  norms  in  2-d. 


Figure  5  -  Effects  of  the  distance  functions  to 
obtain  PC-plots. 
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metric.  Figure  5  shows  the  effect  of  norm  invariance  on  the  cross  join  of  two  California  datasets  (political 
and  water).  It  is  clear  that  the  three  Lp  metrics  chosen  result  in  parallel  lines.  Therefore,  for  the  rest  of  this 
work,  we  will  only  focus  on  the  Linf  metric.  We  can  conclude  that  the  pair-count  exponent  shows  an  intrinsic 
property  of  the  two  point-sets,  and  it  is  independent  of  the  particular  Lp  distance  function  used  to  build  the 
PC  plot. 

4.  IMPLEMENTATION  AND  SPEED  ISSUES 

By  the  definition  of  the  ‘pair-count  exponent’ ,  we  need  to  estimate  the  pair-counts  for  several  distances  r.  Each 
of  them  requires  0(N*M)  operations,  which  are  quadratic  on  the  size  of  the  input  datasets.  This  is  prohibitive 
for  large  datasets.  The  question  becomes:  how  we  can  accelerate  the  computation  of  {?.  This  is  precisely  the 
topic  of  this  Section. 


Figure  6  -  A  grid  superimposed  over  a  point- 
set  to  count  CAi  and  CBi 


4.1.  A  faster  way  to  compute  the  ‘pair-count  exponent’  9 

Here  we  give  a  Lemma,  which  computes  of  the  pair-count  exponent  0(N+M)  and  thus  performs  dramatically 
faster  for  huge  datasets.  A  crucial  concept  that  we  introduce  is  the  Box-Occupancy -  Product-Sum  (BOPS), 
which  is  defined  as  follows.  Consider  the  address  space  of  two  point-sets  in  a  ^-dimensional  space,  and  impose 
an  n-grid  with  grid-cells  of  side  s  (or,  equivalently,  radius  r=s/2 ).  Focusing  on  the  z-th  cell,  let  CA  i ,  CB  i  be 
the  counts  (‘occupancies’)  of  points  from  the  first  and  from  the  second  point-set,  respectively,  as  illustrated  in 
Figure  6. 

Definition  2:  The  "Box-Occupancy-Product-Sum"  (BOPS)  of  a  grid  with  cell  side  5  is  defined  as  the  sum 
of  products  of  occupancies  as 

BOPS(s)  =  ^Cai*Csj  (4, 

i 

and  the  BOPS  plot  is  the  plot  of  BOPS(s)  as  a  function  of  the  grid  side  s,  in  log-log  scales. 

Lemma  1  (BOPS):  The  pair-count  exponent  9  for  a  given  radius  is  equal  to  the  box-occupancy-  product-sum 
(BOPS)  for  the  doubled  radius;  that  is 

PC(sI2)~BOPS(s)  (5) 
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Proof:  The  fundamental  assumption  is  that  the  densities  of  points  are  smooth  functions.  Thus,  if  a  point  p, 
of  set  A  has  x  neighbors  from  the  set  B  within  radius  r,  so  does  a  close-by  neighbor  p2  that  also  belongs  to 
set  A. 

Thus,  for  a  given  cell  side  5  and  another  given  cell  (say,  the  /- th  one),  consider  one  of  the  points  of  the  set 
A.  This  point  has  a  number  of  neighbors  proportional  to  CB  i  neighbors  from  the  set  B  within  radius  s/2. 
Thus,  the  /-th  cell  contributes  with 


C A,i  *Cb,i 


pairs.  Adding  up  the  contributions  of  all  the  cells,  we  have 

PC(s/2)  =  ^CtJCBJ 

i 


(6) 


(7) 


which  completes  the  proof. 


QED 


Corollary:  The  BOPS  follows  a  power  law  with  its  exponent  equal  to  the  "pair-count  exponent". 

BOPS(s)  =  sp  («) 

Proof:  Trivial,  from  Lemma  1  and  Law  1. 

QED 

We  are  going  to  use  the  estimation  PC(r)  =  BOPS(2r)  for  the  rest  of  this  work.  The  ‘BOPS’  Lemma  has 
important  efficiency  implications  which  are  vital  for  large  datasets.  Next  we  show  how  to  use  this  Lemma  for 
fast  selectivity  estimations. 

4.2.  Algorithms 

The  problem  is  defined  as  follows. 

Given  two  point-sets  A  and  B  in  //-dimensional  space, 

Estimate  their  pair-count  exponent  i/’and  the  proportionality  constant  K. 

We  developed  a  single-pass  algorithm  to  obtain  the  BOPS  plot.  Specifically,  the  algorithm  is  linear  0(N+M) 
over  the  total  number  of  points  in  both  datasets.  If  l  is  the  number  of  points  that  we  want  in  the  BOPS  plot  (ie., 
number  of  grid-sizes),  then  the  complexity  of  our  algorithm  is  0((N+M)*l*n ),  where  n  is  the  dimensionality 
of  the  input  point-sets.  Below  is  a  brief  algorithm  to  generate  the  BOPS-plot  and  the  estimate  of  the  pair-count 
exponent. 

4.3.  Estimation  of  selectivity 

Here  we  describe  exactly  how  to  estimate  the  spatial  join  selectivities,  exploiting  our  two  major  observations, 
the  pair-count  law  and  the  BOPS  lemma.  More  specifically,  the  problem  is  as  follows. 

Given  two  point-sets  A  and  B,  and  a  radius  r. 

Estimate  the  count  of  pairs  PC(r). 

We  distinguish  the  following  methods,  depending  on  what  else  we  are  given: 

•  PC  plot  estimation:  Through  previously  kept  statistics  on  the  PC  plot,  suppose  that  we  already  know  the 
pair-count  exponent  9  and  the  proportionality  constant  K.  Then  we  estimate  immediately  the  PC  plot  as 
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PC(r)  =  K*  / 

•  BOPS  plot  estimation:  We  assume  that  we  are  given  only  the  dataset,  without  any  statistics  about  the 
data.  Then,  we  generate  the  BOPS  plot  for  several  values  of  grid-side  s,  and  we  estimate  the  slope  ^and 
the  constant  K,  as  explained  in  the  algorithm  in  Figure  7.  Notice  that  we  not  only  obtain  our  estimate,  but 
we  also  provide  V°and  K  for  future  upcoming  queries. 

Without  loss  of  generality,  due  to  Observation  2, 

Normalize  the  address  space  of  the  datasets  to  the  unit  hyper¬ 
cube; 

For  each  desirable  grid-size  s=1/2f,  j=  1,  2,  ...,  /; 

For  each  point  a  of  dataset  A 

Decide  which  grid  cell  it  falls  in  (say,  the  i-th  cell); 

Increment  the  count  CAi; 

For  each  point  b  of  dataset  B 

Decide  which  grid  cell  it  falls  in  (say,  the  i-th  cell); 

Increment  the  count  CB;, 

Compute  the  sum  of  product  occupancies  ; 

BOPS(s)  =  £C„  *CBJ 

Print  the  values  of  log(s/2)  and  log(BOPS(s))  as  the  BOPS- 
plot; 

Perform  a  linear  interpolation  and  report  the  slope  P  and  the; 
proportionality  constant  K. 


Figure  7  -  Algorithm  for  calculating  BOPS  plots. 

An  obvious  trick  to  approximate  the  BOPS  plot  is  to  do  sampling  first.  We  discuss  its  relative  merit  in 
Section  5. 


5.  EXPERIMENTS 

We  implemented  our  method  and  checked  whether  the  power  law  holds  for  different  data  sets.  For  the  sake  of 
clarity  we  named  the  datasets  used  in  the  experiments.  Point-sets  come  in  groups;  thus,  each  dataset  is 


Figure  8  -  Real  data  used  in  the  experiments,  (a)  California:  CA-pol  and  CA-wat,  (2-dimensional  point-sets), 
(b)  Iris:  setosa,  versicolor  and  virginica  (4-dimensional  point-sets)  and  (c)  Galaxy:  class  dev  and  exp  (2- 
dimensional  point-sets). 
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characterized  by  its  group  name,  a  dash  and  the  dataset  name.  Their  characteristics  are  as  follows. 

California  -  T wo-dimensional  sets  of  points,  they  refer  to  geographical  coordinates  in  California  (see  Figure 
8(a)).  The  four  files  contain  data  features  from  streets  (C  A-str  with  62,933  points),  railways  (CA-rai  with 
31,059  points),  political  borders  (CA-pol  with  46,850  points),  and  natural  water  systems  (CA-wat,  with 
1 172,066  points)  [censqj. 

Iris  -  This  set  contains  three  files,  each  of  which  describes  a  few  properties  of  a  specific  flower  type  of  Iris. 
The  points  are  4-dimensional  (sepal  length,  sepal  width,  petal  length,  petal  width);  the  species  are 
‘virginica’,  ‘versicolor’  and  ‘setosa’,  and  there  are  50  points  from  each  species.  This  is  a  well-known 
dataset  in  the  literature  of  machine  learning  and  statistics,  which  we  obtained  from  the  UC -Irvine 
Repository  (see  Figure  8(b)). 

Galaxy  -  Galaxies  come  from  the  SLOAN  telescope:  (x,y)  coordinates,  plus  class  label  (see  Figure  8(c)). 
There  are  82,277  in  the  ‘dev’  class  (deVaucouleurs),  and  70,405  in  the  ‘exp’  class  (exponential). 

Eigenfaces  -  Two  datasets  (‘lyf  with  1 1,900  points;  and  ‘tyf  with  3,456  points)  come  from  the  Informedia 
project  [wks+96]  at  Carnegie  Mellon  University.  Each  face  was  processed  with  the  eigenfaces  method  [tp9h, 
resulting  in  16-dimensional  points. 

Our  experiments  are  designed  to  answer  the  following  questions. 

•  How  often  do  real  datasets  follow  the  proposed  power  law? 

•  How  good  is  the  linear  fit? 

•  How  accurate  is  our  ‘box-occupancy-product-sum’  Lemma? 

•  What  are  the  effects  on  sampling  and  affine  transformations  on  them  ? 

•  How  fast  is  the  BOPS  method,  compared  to  other  estimations  of  PC( r)l 


Galaxy  dev  X  exp  Galaxy  Dev  Galaxy  Exp 


California  pol  X  wat  California  Pol  California  water 


Figure  9  -  PC  plots  and  the  pair-count  exponents  S’  of  geographical  data.  First  row:  Galaxy  datasets  (a)  cross 
join  of  ‘dev’  and  ‘exp’,  (b)  self  join  of  ‘dev’,  (c)  self  join  of  ‘exp’.  Second  row  California  datasets  (d)  cross 
join  of  CA-pol  and  CA-wat,  (e)  self  join  of  CA-pol,  (f)  self  join  of  CA-wat. 
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5.1.  Accuracy  of  ‘PC’  Law 

We  present  our  experiments  in  two  groups,  two-dimensional  geographical  datasets  (California  and  Galaxy 
data),  and  higher-dimensionality  ones  (Iris,  Eigenfaces). 

5.1.1  -  Geographical  datasets 

The  immediate  application  for  the  pair-count  exponent  is  to  estimate  the  selectivities  for  cross  spatial  joins. 
Thus,  the  natural  candidates  to  show  that  this  method  works  are  geographical  datasets.  Figure  9  shows  the  pair- 
count  exponent  for  California  and  Galaxy  datasets,  and  it  can  be  seen  that  the  PC  plots  are  linear  for  a  suitable 
range  of  r.  The  slopes  of  the  fitting  lines  are  also  shown,  and  these  give  us  the  proportionality  constant  that  will 
be  used  to  estimate  the  selectivities  in  cross  or  self  joins. 

5.1.2  -  Higher  Dimensional  datasets 

Figure  10  presents  the  PC -plots,  the  fitting  lines  and  the  pair-count  exponent  9  for  the  Eigenfaces  datasets 
which  are  16-dimensional  data.  It  can  be  seen  that  our  power  law  remains  quite  accurate  for  high-dimensional 
datasets.  Recurring  conclusions  from  all  the  above  experiments  are: 

1.  The  linear  fit  implied  by  our  ‘pair-count’  law  is  extremely  precise,  for  a  wide  variety  of  diverse  datasets. 

2.  For  self-joins,  as  well  as  for  cross-joins,  the  correlation  coefficient  of  the  fit  is  at  least  0.995  (where  T 
is  the  value  of  perfect  linear  correlation). 

3.  Especially  for  the  high-dimensional  datasets,  the  self-join  exponent  is  significantly  lower  than  the 
embedding  dimensionality  of  the  data.  For  example,  in  Eigenfaces,  the  intrinsic  dimensionality  is  between 
4.5  to  6.7  (values  of  1?  varies  from  4.49  for  self-join  of  iyf  to  6.73  for  the  cross-join  of  ‘tyf  and  iyf ), 
while  the  embedding  dimensionality  E  was  16.  This  implies  that  these  /(-dimensional  points  are  not  even 
close  to  being  uniformly  distributed  (if  they  were,  then  9  =  16).  Thus,  any  analysis  making  the  uniform 
assumption  will  be  very  inaccurate,  since  the  dimensionality  of  the  data  ( 9  or  E )  is  in  the  exponent! 


Figure  10  -  PC  Plots  and  the  pair-count  exponent  £P  of  the  Eigenfaces  datasets,  (a)  self  join  of  ‘Iyf  dataset, 
(b)  self  join  of  ‘tyf  dataset,  (c)  cross  join  of  ‘Iyf  and  ‘tyf  datasets. 


5.2.  Sampling 

We  present  further  experiments  in  order  to  illustrate  Observation  3,  which  states  that  PC  plots  are  invariant  to 
sampling.  Figure  1 1  presents  the  pair-count  exponents  obtained  from  PC  plots  (points)  and  BOPS  plots  (lines). 
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All  plots  are  clearly  parallel.  Table  2  shows  the  results  for  the  Galaxy  and  California  datasets  when  the  pair- 
count  exponent  was  calculated  for  self-joins.  Sampling  clearly  has  negligible  effects  on  the  PC  exponent.  Table 
3  shows  the  results  for  the  same  datasets  using  the  pair-count  exponent  obtained  from  PC  plots  and  from  BOPS 
plots. 

Conclusions  from  the  above  experiments  are  as  follows. 

1 ) .  The  pair-count  exponent  Ms  practically  unaffected  by  sampling,  for  reasonable  sample  sizes  (e.g.,  equal 
or  higher  than  10%). 

2) .  Whatever  the  sampling  rate,  the  corresponding  BOPS  plot  on  the  samples  is  very  close  to  the  pair- 
count  plot  of  the  samples.  This  means  that  whatever  the  time  that  sampling  can  save,  BOPS  applied  on  the 
samples  will  outperform,  with  practically  the  same  accuracy. 

The  estimation  of  tP  obtained  from  BOPS  results  on  relative  error  practically  always  less  than  5%.  Only 
when  the  sampled  size  of  a  dataset  is  very  small,  the  BOPS  plot  results  in  a  9%  error;  indeed,  9%  of  error  is 
also  a  reasonable  value. 


log(dist) 


Figure  11  -  PC -plots  and  corresponding  BOPS  plots  for  (a)  California  datasets;  (b) 
Galaxy  datasets.  Both  plots  are  shown  for  the  full  datasets  and  three  levels  of  sampling. 


Galaxy 

California 

Sampling 

rate 

dev 

exp 

pol 

wat 

str 

100% 

1.876 

1.928 

1.650 

1.529 

1.838 

20% 

1.875 

1.932 

1.643 

1.562 

1.701 

10% 

1.873 

1.952 

1.631 

1.694 

1.661 

5% 

1.880 

2.146 

1.515 

1.711 

1.623 

Table  2:  The  pair-count  exponents  fP  for  samples  of  Galaxy  (‘dev’ 
and  ‘exp’)  and  California  (CA_pol,  CA_wat  and  CA_str)  datasets 
for  self-joins. 
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Sampling 

rate 

Galaxy 
dev  x  exp 

California 
pol  x  wat 

California 
pol  x  str 

tP  from 

PC 

£P  from 

BOPS 

tP  from 

PC 

ff  from 

BOPS 

S’  from 

PC 

9  from 

BOPS 

100% 

1.915 

1.963 

1.835 

1.819 

1.783 

1.743 

20% 

1.915 

1.963 

1.833 

1.825 

1.776 

1.759 

10% 

1.902 

1.965 

1.839 

1.816 

1.783 

1.715 

5% 

1.918 

1.736 

1.856 

1.786 

1.752 

1.725 

Table  3:  The  pair-count  exponent  £P  values  (PC  and  BOPS)  for  joins  on  sampled  data 
from  Galaxy  (‘dev’  and  ‘exp’)  datasets  and  also  on  California_pol,  California_wat, 
California_str  datasets. 


5.3.  Accuracy  of  Selectivity  Estimations 

We  see  that  the  pair-count  Law  is  obeyed  (Figures  9  and  10).  We  also  have  just  seen  (Figure  1 1  and  Table  3) 
that  our  BOPS  Lemma  leads  to  very  close  approximations  for  the  pair-count  exponent.  The  question  now 
becomes  how  precise  the  selectivity  estimation  PC( r)  can  be  by  using, 

(a)  our  Law  1  and 

(b)  our  estimates  from  BOPS. 

PC(r)  —  PC(r) 

Table  4  shows  the  relative  error  for  the  selectivities  calculated  by  - PCO - ’  anC*  We  rePort  t'ie 

geometric  average  values  for  several  values  of  r.  The  top  row  estimates  PC(r )  as  follows. 

Step  (a):  Compute  the  PC  plot. 

Step  (b):  Fit  the  line  to  obtain  the  estimation. 

In  order  to  measure  the  relative  error  in  estimating  the  selectivities  of  queries,  we  compared  pair-count  exponent 
methods  to  the  real  prediction  given  by  Law  1.  Table  4  presents  the  geometric  average  of  the  relative  error  of 
the  PC  plot  by  the  pair-count  exponents  £P  when  we  compare  the  values  obtained  from  PC  and  BOPS  plots  with 
the  actual  figures  given  by  Law  1 . 


Galaxy 

California 

dev  x 

exp 

dev  x 
dev 

1 

BEBB1 

pol  X 
pol 

wat  x 

wat 

PC  plot 
estimation 

0.02 

0.01 

0.02 

0.02 

0.02 

0.06 

BOPS  plot 
estimation 

0.13 

0.24 

0.30 

0.34 

Table  4  -  Geometric  average  of  the  relative  error  of  selectivity  estimation. 
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Datasets 

PC-Plot 
(time  in  sec.) 

BOPS 

(time  in  sec.) 

pol  x  wat 
(100%  of  data) 

7,752.50 

3.44 

pol  x  wat 
(10%  of  data) 

73.36 

0.5 

California 

str  x  rai 
(100%  of  data) 

4,434.27 

2.55 

str  x  rai 
(10%  of  data) 

42.64 

0.47 

pol  x  str 
(100  %  of  data) 

7,664.28 

3.44 

pol  x  str 
(10%  of  data) 

66.58 

0.53 

Galaxy 

dev  x  exp 
(100%  of  data) 

13,078.38 

5.27 

dev  x  exp 
(10%  of  data) 

126.98 

0.72 

setosa  x  virginica 

5.32 

0.01 

Iris 

virginica  x 
versicolor 

4.98 

0.01 

Table  5  -  Clock  time  in  seconds  to  obtain  the  pair-count  exponent  by 
PC-plots  and  BOPS-plots. 


5.4.  Timing  results 

The  question  now  becomes:  (a)  how  long  it  takes  to  estimate  the  PC  exponent  with  the  PC  plot  and  (b)  how 
long  it  takes  to  obtain  the  estimation  from  the  BOPS  plot.  Table  5  reports  the  wall  clock  times  for  each  plot 
on  an  Intel  Pentium  II 450  MHz,  running  Windows  NT.  Both  methods  were  implemented  in  C++  language. 

We  can  see  in  Table  5  that  there  is  a  huge  difference  in  the  CPU  time  when  calculating  the  PC  plots  and 
BOPS  plots.  Calculating  the  pair-count  exponent  using  BOPS  method  save  orders  of  magnitude.  Moreover, 
BOPS  plots  give  a  fast  and  accurate  approximation  of  9.  Sampling  also  gives  a  close  approximation  of  9,  but 
is  much  more  time-consuming  because  all  the  dataset  must  be  scanned  in  order  to  generate  the  sample  before 
to  apply  the  PC  plot.  When  we  compare  the  time  needed  to  obtain  the  pair-count  exponent  for  a  dataset 
sampled  to  10%  of  the  data  (a  limit  to  preserve  the  accuracy  of  the  estimation),  BOPS  still  remains  much  faster 
than  sampling  technique,  from  5.27  seconds  for  the  whole  dataset  for  BOPS  to  2.1 1  minutes  for  a  10% 
sampling  for  the  PC  plot. 

Table  5  reports  the  times  needed  to  build  each  plot  for  several  pairs  of  datasets.  It  also  shows  the  times, 
when  only  samples  are  fed  into  the  two  algorithms.  The  sampling  rate  is  reported  on  each  row,  and  it  is  the 
same  for  both  datasets.  The  observations  are  the  following: 

1) .  Our  BOPS  method  is  up  to  four  order  of  magnitude  faster. 

2) .  In  fact,  BOPS  on  the  full  sets  is  still  faster  than  the  PC  plots  on  the  samples  (10%  sampling  rate),  up 


13 


to  20  times!  Thus,  we  conclude  that  the  BOPS  plot  is  a  fast  and  accurate  tool  for  selectivity  estimation  of 
spatial  joins. 


6.  DISCUSSION 

Our  discussion  addresses  two  questions,  which  are 

a)  How  often  should  we  expect  the  ‘pair-count’  law  to  hold?  w 

b)  How  can  we  use  it  to  do  other  extrapolations? 

6.1.  How  often? 

We  mention  that  power  laws  regularly  occur  in  real  datasets.  In  fact,  our  ‘pair-  count'  law  is  obeyed  by  the 
self-join  of  any  self-similar  dataset,  in  which  case  the  ‘pair-count’  exponent  is  exactly  the  correlation  fractal 
dimension  D,  of  that  dataset.  It  is  well-known  that  vast  majority  of  real  datasets  are  self-similar  [bf  95], 
coastlines,  with  fractal  dimension  1 . 1- 1 .3,  stock  prices  (fractal  dimension  =  1 .5),  rain  patches  (fractal  dimension 
=  1.3),  brain  surface  of  mammals  (fractal  dimension  =  2.6-2.1).  As  we  have  just  seen,  the  same  is  true  for  the 
self-joins  of  our  real  datasets  (1.9  for  the  GALAXY  datasets,  1.5-1. 8  for  the  CA  datasets,  1. 9-2.9  for  the  4- 
dimensional  IRIS  datasets,  and  4.5-5.4  for  the  16-dimensional  Eigenfaces  datasets). 


6.2.  Other  extrapolations 


There  is  a  wealth  of  estimations  that  we  can  perform  whenever  a  pair  of  real  datasets  obeys  the  pair-count  law, 
and  the  invariant  properties  of  the  pair-count  exponent  tP.  One  extrapolation  is  to  estimate  the  minimum 
distance  r„„„  between  the  closest  pair  of  points.  The  formula  is 


PC(rmJ  =  l  =  Kr 


t? 


r  . 

mm 


(11) 


The  justification  comes  straightforward  from  Law  L  We  can  also  estimate  the  distance  rc  of  the  oth  closest 
pair  and  the  formula  is 

PC(rc)  =  Krf  (12) 

Additional  extrapolations  can  be  performed  for  subsets  and  supersets  of  the  two  original  datasets  since  the  pair- 
count  exponent  9  is  not  affected  by  sampling. 


7.  CONCLUSIONS 

The  main  contribution  of  this  work  is  the  identification  of  a  power  law,  namely  the  ‘pair-count’  law.  This  is 

the  first  and  only  published  law  that  governs  the  distribution  of  pair-wise  distances  between  two  real,  n-  * 

dimensional  point-sets.  This  law  leads  to  the  estimation  of  spatial  join  selectivities  through  a  simple  formula, 

which  is  extremely  accurate,  less  than  9%  of  error.  Given  the  pair-count  exponent  S’,  the  selectivity  estimations 

can  be  performed  in  constant  time  (0(1 ))  without  the  need  for  sampling  or  any  other  costly  operations. 

Additional  contributions  include  the  following: 
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•  The  identification  of  several  invariant  properties  of  the  pair-count  exponent  9.  It  is  invariant  to  rotation, 
translation,  scaling,  sampling.  Moreover,  this  holds  for  any  Lp  norm. 

•  Efficiency  issues:  the  introduction  of  the  BOPS  concept  (box-occupancy-product-sum).  It  allows  a  fast 
estimation  of  the  pair-count  exponent  9  Its  response  time  is  orders  of  magnitude  better  than  the 
straightforward  estimation  using  the  pair-count  function  PC( r).  Thanks  to  the  BOPS  plot,  the  whole  concept 
of  the  pair-count  exponent  becomes  practical.  In  fact,  our  method  used  on  the  full  sets,  is  still  significantly 
faster  than  the  PC  plots  on  samples. 

•  Experiments  on  many,  diverse  datasets.  The  experiments  show  that  (a)  the  pair-count  law  holds  for  a 
surprisingly  large  number  of  real  datasets  and  (b)  that  our  BOPS  approximation  is  highly  accurate.  The 
error  is  less  than  9%  for  the  pair-count  exponent  ff  and  less  than  35%  for  the  selectivity  estimation. 

Future  research  could  focus  on  the  discovery  of  additional  power  laws  in  real,  spatial  datasets,  as  well  as 
on  explaining  the  reasons  why  these  laws  hold. 
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